Web Scraping Whoopsies

Manu Alcalá + Jameson Carter

2023-10-18

What is web scraping?

Broadly, it is the process of automatically pulling data from websites by reading their underlying code. Doing this gets complicated fast:

  • Static websites contain information in HTML code which does not change when you, the user, interact with it.
  • Dynamic websites have information in HTML code which does change as you interact with it.
  • Static vs. Dynamic websites require different code packages to scrape.
    • For example, selenium is a common dynamic web scraping library and scrapy is a common static web scraping library.
    • This is because a static scraper simply has to understand what it is looking for in the HTML. A dynamic scraper needs to simulate a human interacting with the site.

When is it worth it?

  • Other data collection efforts are impossible or prone to human error
  • Getting new data releases quickly is valuable
  • Creating reproducible and reliable processes eases collaboration / quality assurance
  • Time spent coding < time spent obtaining data in other ways + time spent on quality assurance

What makes it hard? (can be multiple slides)

  • It is difficult to predict how much time a web scraping task will take.

  • Sites might change, introducing need to update.

  • Site maintainers may not be okay with data being scraped

  • Sites might be removed or stop being maintained.

Example 1. Scraping Medicaid enrollment data

Why Medicaid enrollment data?

Since Spring 2023, states have been disenrolling Medicaid beneficiaries who no longer qualify since the Public Health Emergency was ended.

Why are the data interesting?

In anticipation of “the great unwinding,” many states implemented policy changes to smooth the transition.

To understand the success of these policies, we wanted time-series enrollment data for all 50 states… from a Medicaid data system that is largely decentralized.

Unreadable PDFs abound!

An example from Louisiana

and another from Ohio

A sigh of relief…

Why page through PDFs when another organization’s RAs can do it for you?

1. Identify this is a scrape-able dynamic page

One URL with data you can only get by clicking each option!

2. Confirm HTML actually contains the data

3. Code for 30 hours!

4. Bask in the glow of automated scraping

Whenever new data were released in the following 2 months, I re-ran the code and got a well-formatted excel file as output.

Little did I know, trouble was coming

What happened?

2 months later, KFF stopped updating the dashboard and changed how existing data was reported on graphs.

Example 2. Scraping Course Descriptions

Work Based Learning (WBL)

Work-based learning can include internships, clinicals, co-ops and other opportunities to gain experience in a work setting.

They are especially helpful to community college students

We sought to create a national dataset of WBL prevalance by scraping course descriptions

Course description data

  • No centralized source, need to scrape each school individually
  • No general pattern to where course descriptions can be found, how they are arranged, or what format they’re in

An example of course descriptions listed under department pages

And an example containing links to course catalogs in .pdf format

Are we doomed?

  • How do we find all the relevant links?
  • How can we target HTML code when each website has an entirely different one?

Web Crawling with Scrapy

  • Scrapy is a web scraping framework designed for large-scale web scraping projects
  • It excels particularly in web crawling, i.e. traversing links and navigating websites in accordance with specified rules.
  • It has several helpful features to respect scraping restrictions, avoid overwhelming servers, and pre-process scraped data

Initial Web Scraping Attempt

Starting from a homepage, we used Scrapy to follow links containing keywords like “catalog” or “course descriptions”

For each link, we scraped basic metadata and all the text present

Example output for one college

Whoopsie…

After a lot of hard work refining our approach, our data was a hot mess:

  • We did not get any data from about 1/3 of the schools
  • We couldn’t uniquely identify courses which made it impossible to filter out duplicates
  • We couldn’t verify whether we had found all course descriptions

A different approach

  • We decided to lower our ambition and hone-in on a single state instead

Disaster Averted

  • We used Selenium to navigate to each course until finding a static page containing course descriptions.
  • We used BeautifulSoup next to parse specific HTML tags (which were standard across sites)

In defense of Scrapy

  • Although Scrapy didn’t fit our use case, I highly recommend it as a fast, powerful tool when:
    • Websites follow a similar structure
    • The list of pages to be scraped is already known
    • There are a few websites to crawl through but a large number of pages to scrape (e.g., scraping reviews for different products)

Concluding remarks

Core questions to explore before you start

Availability of data

  • Are the data available through other routes?

  • Are the data produced by an organization that is invested in the problem long-term?

Frequency of scraping

  • Will I need to scrape the data multiple times?

  • What is the risk that the item scraped from the site will be changed?

Time-value tradeoffs

  • Is the time spent coding worth the payoff?

  • Will collecting data automatically save time on quality assurance?

Questions?

The remainder of the time is reserved for group discussion!

Thank you!

Please contact Manu Alcalá or Jameson Carter if you would like to discuss either of these projects or scope whether a use-case is reasonable.